A chromosome-level genome assembly of the redfin culter (Chanodichthys erythropterus)

Chanodichthys erythropterus is a fierce carnivorous fish widely found in East Asian waters. It is not only a popular food fish in China, it is also a representative victim of overfishing. Genetic breeding programs launched to meet market demands urgently require high-quality genomes to facilitate genomic selection and genetic research. In this study, we constructed a chromosome-level reference genome of C. erythropterus by taking advantage of long-read single-molecule sequencing and de novo assembly by Oxford Nanopore Technology (ONT) and Hi-C. The 1.085 Gb C. erythropterus genome was assembled from 132 Gb of Nanopore sequence. The assembled genome represents 98.5% completeness (BUSCO) with a contig N50 length of 23.29 Mb. The contigs were clustered and ordered onto 24 chromosomes covering roughly 99.49% of the genome assembly with Hi-C data. Additionally, 33,041 (98.0%) genes were functionally annotated from a total of 33,706 predicted protein-coding sequences by combining transcriptome data from seven tissues. This high-quality assembled genome will be a precious resource for future molecular breeding and functional genomics research of C. erythropterus.


Background & Summary
Chanodichthys erythropterus (Basilewsky, 1855), which belongs to the family Cyprinidae, is widely spread in East Asia, inhabiting lakes or slow-moving rivers with rich vegetation 1 . Its juvenile fish feed on zooplankton, such as copepods, while adults mainly feed on small fish, a small and fierce carnivorous fish 2 . The C. erythropterus is highly adaptable to its natural environment and is not obviously affected even when living in alkaline lakes like Hulun Lake 3,4 .
Due to its delicious and delicate flesh, the C. erythropterus is so popular with consumers in the market and has a high commercial value 5 . Over the last decade, interest in the aquaculture of C. erythropterus has increased to meet market demand as wild stock is under threat due to overfishing and water pollution. Whole-genome sequencing of a given species is an important and essential tool to address important questions in both biological research and aquaculture. Former research on C. erythropterus has mostly focused on reproduction, age and growth 6,7 , feeding habits 2 , muscle composition 8 , and population genetics 9 . To date, no genomic resources are available for C. erythropterus, however, severely hampering research into its phylogeny, evolution and biology. Both genomic data and resources can provide a basis for our subsequent studies on the species diversity and population dynamics of C. erythropterus, and can provide a solid support for the proposal of logical conservation measures.
In the current study, the chromosome-level genome of Chanodichthys erythropterus was constructed using Nanopore sequencing and Hi-C technology. We have obtained a scaffold N50 of 42.39 Mb for the final genome assembly, which is approximately 1,085.51 Mb. Using Hi-C data, we identified that 99.49% of the assembled bases were associated with the 24 chromosomes. A valued resource for the conservation and breeding management of C. erythropterus, this genome could serve as the genetic basis for future research into its evolution and biology.

Methods
Sampling and sequencing. The C. erythropterus sample that was obtained in the Hulun Lake (Inner Mongolia, China) was used for genome sequencing and assembly. The muscle tissue was stored at −80 °C and used for DNA extraction, genomic DNA sequencing, and Hi-C library construction. We used a standard SDS extraction method to obtain high-molecular weight DNA.
Following the manufacturer's recommendations, sequencing libraries were generated using the Truseq Nano DNA HT Sample Preparation Kit (Illumina, USA) and an index code was added to attribute sequences to each sample. These libraries constructed above were sequenced by the Illumina NovaSeq 6000 platform and yielded 150 bp paired-end reads with an insert size of approximately 350 bp. We obtained 41 Gb of raw genomic data for C. erythropterus as a result of Illumina sequencing.
Sequencing was performed on flow cells on the PromethION sequencer according to the manufacturer's instructions. The Nanopore technology yielded 132 Gb of high-quality data from the long-read library, which covered 117.86-fold of the genome assembly.
In order to obtain chromosome-level assembly of the genome, a high-throughput chromatin conformation capture (Hi-C) library was built for sequencing 10 . We built the Hi-C library, which used original samples as input. Following grinding with liquid nitrogen, crosslinking was carried out with a 4% formaldehyde solution under vacuum for 30 minutes at room temperature. Add 2.5 M glycine to quench the cross-linking reaction for 5 minutes. Nuclei were digested with 100 units of MboI, tagged with biotin-14-dCTP and subsequently ligated with T4 DNA Ligase. The following incubation overnight to reverse cross-linking, the ligated DNA was segments sheared into 200 to 600 bp fragments. Blunt-end repair and A-tailing of DNA fragments followed by purification through biotin-streptavidin-mediated pulldown. The Hi-C libraries were eventually quantified and sequenced on Illumina PE150.
RNA was also extracted from seven tissues of the C. erythropterus, including intestine, liver, muscle, spleen, heart, gallbladder and kidney, transcriptome sequencing was performed on the Illumina NovaSeq 6000 platform and the resulting reads were used for gene prediction.
Genome size estimation and contig assembly. The Illumina data were analysed for k-mer depth frequency distribution to estimate the genome size, heterozygosity and the amount of repetitive sequences in C. erythropterus. The genome size (G) was estimated according to the following formula: G = k-mer number/k-mer depth, in which the k-mer number and k-mer depth are the total number and average depth of the 17 mers, respectively 11 . Using 41 Gb of clean Illumina data, the k-mer depth frequency distribution analysis was used for the genome of C. erythropterus (Fig. 1). On the basis of a total of 30,891,679,507 17-mer and a peak 17-mer depth of 27, the estimated genome size was 1120.68 Mb, the heterozygosity was 0.31%, and the amounts of repetitive sequences and guanine-cytosine were roughly 57.05% and 37.95%, respectively (Table 1).
Using all Nanopore sequencing data, a preliminary assembly of the C. erythropterus genome was performed using NextDenovo assembler (v2.3.1) (https://github.com/Nextomics/NextDenovo) with the following parameters: "read_ cutoff = 1k, pa_correction = 20, sort_options = -m 20 g -t 10, correction_options = -p 10". Finally, the contigs sequences were corrected by NextPolish (v1.3.1) 12 using Illumina raw data as well as Nanopore Fig. 1 17-mer frequency distribution in C. erythropterus genome. The X-axis is the k-mer depth, and Y-axis represents the frequency of the k-mer for a given depth. www.nature.com/scientificdata www.nature.com/scientificdata/ sequencing data. Assembly of these data was then performed with NextDenovo, yielding a genome assembly of 1,085.49 Mb with a contig N50 of 23.28 Mb (Table 2). For this assembly, the length is the same as the genome size estimated by k-mer analysis.
Chromosomal-level genome assembly using Hi-C data. Through the use of the Hi-C scaffolding method 13 , the contigs in the initial assembly are anchored and oriented to the chromosomal scale of the assembly. The Hi-C library generated 86 Gb clean data. After the Hi-C corrected contigs were placed in the ALLhic pipeline 14 for segmentation, orientation and sequencing, the final 99.49% of the assembled sequences were anchored to 24 pseudochromosomes with chromosome lengths that ranged from 31.72 Mb to 73.07 Mb (Table 3). This result is in agreement with the karyotype results which are based on cytological observations 15 , as many cyprinid fish such as Ctenopharyngodon idellus 16 , Ancherythroculter nigrocauda 17 , Hypophthalmichthys molitrix and Hypophthalmichthys nobilis 18 with chromosome numbers of 2n = 48. Further we manually curated the Hi-C scaffolding from the chromatin contact matrix in Juicebox (Fig. 2). The 24 pseudochromosomes are easily distinguishable on the basis of the heatmap, and the strength of the interaction signal around the diagonal is fairly strong, indicating the high quality of this genome assembly. Following Hi-C correction, the final assembled genome was 1,085.51 Mb while the scaffold N50 was 42.39 Mb ( Table 2). The genome size of C. erythropterus was similar to those of some cyprinid fishes such as the Ctenopharyngodon idellus (1.07 Gb), Megalobrama amblycephala (1.09 Gb) 19 , Culter alburnus (1.02 Gb) 19 , and Ancherythroculter nigrocauda (1.04 Gb), but much lower than that of the Cyprinus carpio (1.69 Gb) 20 .
Assessment of the genome assemblies. For evaluating the accuracy and completeness of the genome assembly, we first compared Illumina reads to the assembly of C. erythropterus with the BWA (v0.7.8) 21 in which 98.71% of the reads were able to be mapped to contigs. Additionally, we have assessed the integrity of the genome assembly with Benchmarking Universal Single-Copy Orthologs (BUSCO v5.2.1) 22 with the vertebrata_odb10 database and CEGMA (v2.5) 23 . The final results of both showed that the assembly contained 98.5% of complete genes and 0.4% of fragmentarily conserved single-copy orthologs (Table 4), as well as 97.98% of the 248 core eukaryotic genes. All in all, the results of these assessments indicate to us that the C. erythropterus genome assembly is complete and of high quality.   www.nature.com/scientificdata www.nature.com/scientificdata/ repeat annotation. Aiming to annotate repetitive elements in the C. erythropterus genome, methods combining homologous comparison and ab initio prediction were used. For ab initio repeat annotation, in which a de novo repetitive element database is constructed using LTR_FINDER (v1.0.7) 24 , RepeatScout (v1.0.5) 25 and RepeatModeler (v1.0.8) 26 , the RepeatMasker (v4.0.5) 26 was used to annotate the repeat elements in the database. The RepeatMasker and RepeatProteinMask (v4.0.5) were then used for known repeat element types via a search of the Repbase database 27 . Furthermore, TRF (v4.07b) 28 can be used to annotate the tandem repeat. Ultimately, we identified 557 Mb of repetitive sequences, accounting for 51.34% of the assembled genome. These figures are higher than in Ctenopharyngodon idellus genome (38.06%) and Megalobrama amblycephala genome (38.68%), but slightly lower than that in Danio rerio genome (52.2%). Within this, we identified 469 Mb of LTR which dominated the assembled genome (43.23%) ( Table 5).
The predicted genes of C. erythropterus were functionally annotated by using BLAST 39 against SwissProt 40 , Nr from NCBI, KEGG 41 , InterPro 42 , GO 43 , and Pfam 44 databases with an e-value cutoff of 1E-5. The InterproScan (v4.8) 45 tool is used to predict protein function based on conserved protein structural domains using the InterPro database. The result was that 33,041 genes were successfully annotated for C. erythropterus, representing 98.0% of all predicted genes (Table 8 and Fig. 4).
Eventually, miRNAs and snRNAs were identified via a search of the Rfam database using the default parameters of INFERNAL 46 . We chose the human rRNA sequences as a reference and used BLAST 39 to predict the   www.nature.com/scientificdata www.nature.com/scientificdata/ rRNA sequences of C. erythropterus. The tRNAs were predicted using the program tRNASCAN-SE 47 . As a result, we annotated 1,609 miRNA, 8,135 tRNA, 1,251 rRNA and 1,060 snRNA genes (Table 9).

Data records
The genomic Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR18691804 48 -SRR18691805 49 .
The genomic Nanopore sequencing data were deposited in the Sequence Read Archive at NCBI SRR18828942 50 .
The transcriptome Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR18697292 51 -SRR18697298.
The Hi-C sequencing data were were deposited in the Sequence Read Archive at NCBI SRR18696935 52 . The final chromosome assembly were deposited in the GenBank at NCBI JALPSW000000000 53 .
The annotation results of repeated sequences, gene structure and functional prediction were deposited in the Figshare database 54 .

technical Validation
The concentration of DNA was determined using Qubit Fluorometer and agarose gel electrophoresis, and the absorbance was approximately 1.8 at 260/280.

Code availability
No specific code or script was used in this work. The commands used in the processing were all executed according to the manuals and protocols of the corresponding bioinformatics software.   Table 9. Classification of ncRNAs in C. erythropterus genome.